Image Segmentation Using Expectation–Maximization and Gaussian Mixture Models

Find this project on GitHub

Introduction

Image segmentation is a fundamental task in computer vision as it aims to partition an image into meaningful regions. Instead of treating every pixel independently, segmentation groups pixels with similar characteristics, allowing higher-level understanding of visual data.

An unsupervised image segmentation using the Expectation–Maximization (EM) algorithm applied to a Gaussian Mixture Model (GMM) was developed where each pixel is represented by its RGB color values, and clustering is used to group pixels into color-based segments.

The main objectives of this project are:

  • Implement the EM algorithm for Gaussian mixtures from scratch
  • Segment images into 10, 20, and 50 clusters
  • Analyze how initialization affects segmentation
  • Study variation across multiple runs on images

Each image is treated as a dataset where each pixel represents one data point and the features are its RGB values. Therefore, the data dimension is $d = 3$ and the dataset size is $N$, the number of pixels.

The pixel colors are assumed to be generated from a mixture of Gaussian distributions: $$ p(x_i) = \sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, I) $$ where:

  • $x_i$: Vector of pixels in the dataset
  • $K$: Number of clusters
  • $\pi_j$: Probability of the cluster $j$
  • $\mu_j$: The vector of the centroid $j$

Technical Details

Matrix Representation of the Gaussian Mixture Model

To efficiently implement the EM equations in Python, all variables are stored as matrices. This allows vectorized computations using NumPy and avoids computationally expensive loops. For simplicity, the iteration superscript $^{(n)}$ is omitted and parameters are written directly as $x$, $W$, $\mu$, and $\pi$

Let $$ X = \begin{bmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_N^T \end{bmatrix} $$ be the $N$ × $d$ data matrix containing the input pixels.

The cluster centroids are stored in $$ \mu = \begin{bmatrix} \mu_1^T \\ \mu_2^T \\ \vdots \\ \mu_K^T \end{bmatrix} $$ with dimensions $K$ x $d$.

The mixture weights are represented as the column vector $$ \pi = \begin{bmatrix} \pi_1 \\ \pi_2 \\ \vdots \\ \pi_K \end{bmatrix} $$

The Expectation Step

The goal of the Expectation (E-step) is to compute the probability that each pixel belongs to each cluster given the current model parameters. The mathematical derivation of the update equations is provided in the accompanying document. In this report, the focus is only on how these formulas are implemented computationally.

First, the function find_H computes the pairwise distance matrix

$$H_{i,j} := -\frac{1}{2} \bigg[ (\mathbf{x}_i-\mathbf{\mu}_j)^T(\mathbf{x}_i-\mathbf{\mu}_j) \bigg]$$

An equivalent matrix decomposition is used:

$$ H = -\frac{1}{2}\left(H_1 + H_2 - 2H_3\right) $$

This formulation enables computation using matrix multiplications, significantly accelerating processing for large images.

The function takes $X$ and $\mu$ as inputs and returns $H$ as a numpy array of the shape $(N,K)$.

def find_H(X, mu):
    N = X.shape[0]
    K = mu.shape[0]
    assert X.shape[1] == mu.shape[1]
    try:
        d = X.shape[1]
        H1 = np.square(X) @ np.ones((d,1)) @ np.ones((1,K))
        H2 = np.ones((N,d)) @ np.transpose(np.square(mu))
        H3 = X @ np.transpose(mu)
        H = -(1/2)*(H1 + H2 - 2*H3)
    except:
        raise NotImplementedError
    assert H.shape == (N,K)
    assert H.dtype == np.float64
    return H

After computing $H$, the posterior responsibilities are evaluated in log space for numerical stability. The function find_logW calculates the matrix $\log W$ using:

  • $H$: The $(N,K)$ array from the output of the find_H function above.
  • $log_pi$: A numpy array of the shape $(K,1)$ with the element-wise natural log of the prior probabilities vector $\pi$.

The resulting matrix satisfies

$$\log W_{i,j} := \log\bigg(\frac{\pi_j \exp\bigg(-\frac{1}{2} \bigg[ (\mathbf{x}_i-\mathbf{\mu}_j)^T(\mathbf{x}_i-\mathbf{\mu}_j) \bigg]\bigg)}{\sum_{l=1}^{K} \pi_l \exp\bigg(-\frac{1}{2} \bigg[ (\mathbf{x}_i-\mathbf{\mu}_l)^T(\mathbf{x}_i-\mathbf{\mu}_l) \bigg]\bigg)}\bigg)$$

Using the simplified formulation derived in the accompanying document, we compute $$ \log(E) = H + \log(\pi) $$ and $$ \log(F)_i = \log \sum_{l=1}^{K} \exp\big(\log(E_{i,l})\big) $$ The log responsibilities are then obtained as $$ \log(W) = \log(E) - \log(F) $$

The resulting matrix $logW$ is then passed to the Maximization step.

def find_logW(H, log_pi):
    N, K = H.shape
    try:
        log_e = H + log_pi.reshape((1, -1))
        logW = log_e - logsumexp(log_e, axis=1).reshape((N, 1))
    except:
        raise NotImplementedError
    assert logW.shape == (N,K)
    assert not np.isnan(logW).any()
    assert not np.isinf(logW).any()
    return logW

The Maximization Step

In the Maximization (M-step), the model parameters are updated using the posterior responsibilities obtained from the E-step. Again, all computations are performed in matrix form.

First, the function update_logpi computes

$$\log \pi^{\text{new}}_{j} := \log \frac{\sum_{i=1}^N W_{i,j}}{N}$$ or equivalently, $$ \log \boldsymbol{\pi}^{\,\text{new}} = \log \left( \sum_{i=1}^{N} W_{i,\cdot} \right) - \log N, $$

def update_logpi(logW):
    N,K = logW.shape
    try:
        log_pi = logsumexp(np.transpose(logW), axis=1).reshape((K, 1)) - np.log(N)
    except:
        raise NotImplementedError
    assert log_pi.shape == (K,1)
    assert not np.isnan(log_pi).any()
    assert not np.isinf(log_pi).any()
    return log_pi

Next, the function update_mu computes the updated centroid matrix:

$$\mathbf{\mu}_j^{new} = \frac{\sum_{i=1}^N \mathbf{x_i} W_{i,j}}{\sum_{i=1}^N W_{i,j}}$$

Using the simplified formulation, $$ R = W^{T} = \exp(\log W^{T}). $$ the matrix update becomes $$\mu = (RX)\ \oslash\ (R\,\mathbf{1}_{N\times d}), $$

where $\oslash$ denotes element-wise division and $1_{N×d}$ broadcasts the normalization across dimensions.

def update_mu(X, logW):
    N,K = logW.shape
    d = X.shape[1]
    assert X.shape[0] == N
    try:
        R = np.exp(np.transpose(logW))
        numerator = R @ X
        denominator = R @ np.ones((N,d))
        mu = numerator/denominator 
    except:
        raise NotImplementedError
    assert mu.shape == (K,d)
    assert not np.isnan(mu).any()
    assert not np.isinf(mu).any()
    return mu

Expectation-Maximization Iteration

Finally, the function GMM executes the full EM procedure using the previously defined E-step and M-step updates. The function initializes the model parameters and iteratively applies $$ \text{E-step: } (H, \log W), \qquad \text{M-step: } (\log \boldsymbol{\pi}, \mu), $$ for a fixed number of iterations, after which the final parameter estimates are returned.

def GMM(X, K, initialization_method='kmeans', iterations=100, seed=12345):
    N, d = X.shape
    pi_init = np.ones((K,1))/float(K)
    
    np_random = np.random.RandomState(seed=seed)
    if initialization_method == 'random_pixels':
        mu_init = X[np_random.choice(N, K), :] # mu.shape = (K, d)
    elif initialization_method == 'kmeans':
        kmeans = KMeans(n_clusters=K, random_state=np_random).fit(X)
        mu_init = kmeans.cluster_centers_  # mu.shape = (K, d)

    log_pi = np.log(pi_init) #log_pi.shape = (K,1)
    mu = mu_init

    for iteration in range(iterations):
        print('.', end='')
        #The E-Step
        H = find_H(X, mu)
        logW = find_logW(H, log_pi)

        #The M-Step
        log_pi = update_logpi(logW)
        mu = update_mu(X, logW)
    print('', end=' ')
    
    return mu, H, log_pi

Image Segmentation and Reconstruction

The function segment reconstructs a segmented image using the parameters learned by the GMM procedure. The image is first reshaped into a data matrix $$ X \in \mathbb{R}^{N \times 3}, $$ where each row corresponds to one pixel and N is the total number of pixels. After calling GMM, each pixel is assigned to the cluster with maximum responsibility: $$ z_i = \arg\max_{j} H_{i,j}. $$ The reconstructed image is obtained by replacing every pixel with its assigned centroid color: $$ \hat{\mathbf{x}}_i = \mu_{z_i}. $$ Finally, the pixel matrix is reshaped back to the original image dimensions for visualization alongside the original image. Because the function takes $K$ and $initialization_method$ as arguments, the same implementation can be used to produce multiple segmentation levels (different values of K) as well as different initialization strategies, enabling direct comparison of clustering outcomes under varying model configurations.

def segment(raw_image, K, initialization_method ='random_pixels', seed=12345):
    mu, H, log_pi = GMM(X=raw_image.reshape(-1, 3), K=K, 
                        initialization_method=initialization_method, 
                        iterations=10, seed=seed)
    reconst_image = mu[H.argmax(axis=1), :].reshape(*raw_image.shape)

    fig, axes = plt.subplots(1,2, figsize=(8,4), dpi=144)

    ax = axes[0]
    ax.imshow(raw_image)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title('Original Image')

    ax = axes[1]
    ax.imshow(reconst_image.astype(np.uint8))
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f'Segmented Image (K={K})')

    fig.tight_layout()

Analysis and Model Evaluation

Effect of the Number of Segments $K$

The impact of the number of clusters was evaluated using the values introduced earlier: 10, 20, and 50. A photograph of my cat Miel was used for this comparison. The image contains very muted colors, primarily white and orange fur, along with sage and white sheets in the background.

For $K=10$, the segmentation produces strong color compression. Many subtle tones are merged together, and the image appears almost black and white, as the pale colors dominate the clustering process. Fine variations in shading are largely removed.

When increasing to $K=20$, the original colors become more visible; however, shadows and contrasts appear overly sharp compared to the original image. The segmentation begins capturing more detail but still exaggerates boundaries between regions.

With $K=50$, the reconstruction closely resembles the original photograph. Color transitions are smoother, and subtle variations in fur texture and fabric shading are preserved. At this level, the model has sufficient flexibility to represent the muted color palette without excessive quantization.

Overall, increasing $K$ improves visual fidelity by allowing more precise color representation, though at the cost of increased computation.

A photograph from Lofoten, Norway was analyzed. The image contains strong color contrasts, including sky tones, water reflections, and brightly colored houses. Similar behavior to the Miel experiment was observed: smaller values of $K$ oversimplify the scene, while larger values preserve reflections and color gradients more effectively.

Because the Gaussian Mixture Model clusters pixels purely in RGB space, the algorithm groups pixels based on color similarity rather than spatial proximity. As a result, reflections in the water are often assigned to the same clusters as their corresponding objects, demonstrating consistent color-based segmentation across different regions of the image.

Effect of Initialization Method

To study initialization effects, a different image was used featuring my cat Milo, a calico cat with a more complex color distribution. The scene also includes a light pink couch, introducing additional color variation. Segmentation was performed with $K=50$ using the two available initialization strategies: kmeans and random_pixels.

The K-means initialization required approximately 2.9 seconds, while the random pixel initialization completed slightly faster at 2.4 seconds. Despite this difference in runtime, visual differences between the resulting segmentations were difficult to assess. Both outputs produced coherent and visually accurate reconstructions.

Random initialization:

K-means initialization:

An additional experiment was conducted using an aerial photograph of a neighborhood in Bucaramanga. The image contains many small structures with similar materials, including terracotta roofs, white and grey cement houses, and occasional brightly colored homes. This setting introduces a large number of visually similar regions with subtle color differences.

For $K=10$, both initialization methods produced nearly identical results. At this level of segmentation, clusters represent broad color groups, and initialization has little influence on the final solution.

However, clear differences emerge when increasing the number of clusters to $K=50$. With kmeans initialization, the segmentation preserves smaller color variations more effectively. In particular, the occasional yellow homes remain visible and distinct. In contrast, the random_pixels initialization tends to merge these less frequent colors into larger dominant clusters, causing some colored structures to disappear in the reconstruction.

kmeans initialization provides centroids that are already distributed across meaningful color regions, allowing the algorithm to better capture rare color groups. Random initialization may place initial centroids in dominant color areas, making it harder for smaller clusters to emerge during optimization.

This experiment demonstrates that initialization becomes increasingly important as $K$ grows and as image color distributions become more complex.

Random initialization:

K-means initialization:

Random initialization:

K-means initialization:


Conclusions

This project implements an image segmentation algorithm developed from scratch, based on probabilistic clustering formulated entirely in log-space to ensure numerical stability. The derivation leads to normalized pixel responsibilities computed through log-sum-exp operations, enabling reliable estimation even for high-dimensional image data.

A key focus of the analysis is the impact of initialization strategies on segmentation performance. For lower cluster counts, random initialization and K-means initialization produce comparable results. However, as model complexity increases, K-means initialization significantly improves convergence and visual coherence. This effect is particularly evident in aerial imagery of a Bucaramanga neighborhood, where subtle color variations—especially distinctive yellow houses—are successfully preserved only with structured initialization.

Performance considerations played an important role in the workflow. Images captured on an iPhone were too large for efficient experimentation and had to be compressed prior to processing. While compression reduced computational cost and memory usage, it also altered color distributions and fine texture details, highlighting a practical trade-off relevant to real-world implementations where preprocessing decisions directly influence segmentation outcomes.

The experiments further show that algorithm suitability depends strongly on image complexity and color palette diversity. Scenes with homogeneous materials or limited color variation segment reliably under multiple configurations, whereas heterogeneous urban environments benefit from better initialization and higher cluster resolution to capture meaningful structures.

Also, an important observation: my cats are very cute :)